The Boston research seems to look for correlations between the standard of living, housing quality, and other criteria such as the urban environment, accessibility, and demographics within the Boston area.
Each row represents data for a specific town and has information about various factors such as crime rate, proportion of non-retail business acres, property tax rates, pupil-teacher ratios, etc.
This data frame contains the following columns:
crim - per capita crime rate by town.
zn - proportion of residential land zoned for lots over 25,000 sq.ft.
indus - proportion of non-retail business acres per town.
chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox - nitrogen oxides concentration (parts per 10 million).
rm - average number of rooms per dwelling.
age - proportion of owner-occupied units built prior to 1940.
dis - weighted mean of distances to five Boston employment centres.
rad - index of accessibility to radial highways.
tax - full-value property-tax rate per $10,000.
ptratio - pupil-teacher ratio by town.
black - 1000(Bk−0.63) square Bk is the proportion of blacks by town.
lstat - lower status of the population (percent).
medv - median value of owner-occupied homes in $1000s.
Explore the structure and the dimensions of the dataset. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, the distributions of the variables and the relationships between them.
library(MASS)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(tidyr)
setwd("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")
Boston <- Boston
# explore the mean, median, min, max per variable
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# 506 entries with 14 variables
dim(Boston)
## [1] 506 14
# Data type: all num or int
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# let's create points plot for each variable. -> not easy to read as too many variables
pairs(Boston)
# Let's analyze the histogram for each variable:
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Let's compute a correlation matrix
matrix_correlation_var <- cor(Boston)
# Visualize correlation matrix as a heatmap
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
ggplot(data = melt(matrix_correlation_var), aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0)
according to the matrix - Negative relationships between certain variables:
lstat and medv - lower status of population and the median price of homes owned by occupants
lstat and rm - lower status of population and the average of room numbers per home
tax and medv - the percentage at which a property is taxed based on its assessed value and the the median price of homes owned by occupants
dis and lstat - the weighted average distance from each house to these employment centers and the lower status of population
dis and age - the weighted average distance from each house to these employment centers and the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940.
dis and nox - the weighted average distance from each house to these employment centers and nitrogen oxides concentration.
dis and indus - the weighted average distance from each house to these employment centers and the proportion of non-retail business acres per town (manufacturing facilities, industrial parks, office spaces, warehouses etc).
tax and dis - the percentage at which a property is taxed based on its assessed value and the weighted average distance from each house to these employment centers
zn and age - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940
zn and nox - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the nitrogen oxides concentration.
zn and indus - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the proportion of non-retail business acres per town (manufacturing facilities, industrial parks, office spaces, warehouses etc).
Positive relationships between variables
medv and rm - the median price of homes owned by occupants and the average of room numbers per home
tax and indus - the percentage at which a property is taxed based on its assessed value and the proportion of non-retail business acres per town.
tax and nox - the percentage at which a property is taxed based on its assessed value and the nitrogen oxides concentration.
age and indus - the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940 and the proportion of non-retail business acres per town.
age and nox- the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940 and the nitrogen oxides concentration.
nox and indus - the nitrogen oxides concentration and the proportion of non-retail business acres per town.
–> tests in my model: sltat, medv, rm, tax, dis, age, nox, indus, zn
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
model1 <- glm(lstat ~ medv + dis + rm,data=Boston)
model2 <- glm(medv ~ rm + tax + lstat ,data=Boston)
# all the VIF are between 1 and 2.1 which is ok. It suggest a low multicollinearity and imply that the variance of the estimated coefficients is not significantly inflated due to collinearity
vif(model1)
## medv dis rm
## 1.982257 1.068810 1.940168
vif(model2)
## rm tax lstat
## 1.610953 1.426005 2.092901
#let’s calculate the corresponding ODDS ratios and confidence intervals (95%):
OR1 <- coef(model1) %>% exp
OR2 <- coef(model2) %>% exp
CI1 <- confint(model1) %>% exp
## Waiting for profiling to be done...
CI2 <- confint(model2) %>% exp
## Waiting for profiling to be done...
# the confidence interval for an odds ratio doesn't span 1 = there's a statistically significant effect for both models
cbind(OR1, CI1)
## OR1 2.5 % 97.5 %
## (Intercept) 1.764803e+16 4.023272e+14 7.741285e+17
## medv 6.606608e-01 6.250484e-01 6.983022e-01
## dis 3.292580e-01 2.756488e-01 3.932934e-01
## rm 1.682751e-01 8.210665e-02 3.448749e-01
cbind(OR2, CI2)
## OR2 2.5 % 97.5 %
## (Intercept) 0.6073487 0.001289619 286.0319984
## rm 181.1868455 76.546116788 428.8744403
## tax 0.9935198 0.990167804 0.9968832
## lstat 0.5754723 0.522466602 0.6338557
# the residual deviance is way smaller than the null deviance. It indicates a reasonably good fit of the model to the data.
summary(model1)
##
## Call:
## glm(formula = lstat ~ medv + dis + rm, data = Boston)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.40940 1.92918 19.391 < 2e-16 ***
## medv -0.41451 0.02827 -14.662 < 2e-16 ***
## dis -1.11091 0.09067 -12.252 < 2e-16 ***
## rm -1.78215 0.36612 -4.868 1.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 17.22406)
##
## Null deviance: 25752.4 on 505 degrees of freedom
## Residual deviance: 8646.5 on 502 degrees of freedom
## AIC: 2882.2
##
## Number of Fisher Scoring iterations: 2
# medv and rm also influences each other, so let's modify the model a bit
model11 <- glm(lstat ~ medv + dis + rm + rm * medv ,data=Boston)
summary(model11)
##
## Call:
## glm(formula = lstat ~ medv + dis + rm + rm * medv, data = Boston)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.41052 3.64466 20.42 <2e-16 ***
## medv -1.94316 0.13517 -14.38 <2e-16 ***
## dis -0.89569 0.08285 -10.81 <2e-16 ***
## rm -7.55541 0.59817 -12.63 <2e-16 ***
## medv:rm 0.22526 0.01957 11.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 13.64918)
##
## Null deviance: 25752.4 on 505 degrees of freedom
## Residual deviance: 6838.2 on 501 degrees of freedom
## AIC: 2765.5
##
## Number of Fisher Scoring iterations: 2
# same for this one, residual deviance is way smaller than the null deviance.
summary(model2)
##
## Call:
## glm(formula = medv ~ rm + tax + lstat, data = Boston)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.498652 3.140239 -0.159 0.873895
## rm 5.199529 0.439618 11.827 < 2e-16 ***
## tax -0.006501 0.001724 -3.770 0.000182 ***
## lstat -0.552564 0.049302 -11.208 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 29.90866)
##
## Null deviance: 42716 on 505 degrees of freedom
## Residual deviance: 15014 on 502 degrees of freedom
## AIC: 3161.4
##
## Number of Fisher Scoring iterations: 2
Standardizing the data. Variables often have different scales and units, making direct comparisons challenging. Standardization brings all variables to a common scale, allowing for fair comparisons between different variables. It makes the distribution of each variable more consistent, with a mean of 0 and a standard deviation of 1. This normalization aids in interpreting coefficients and comparing the relative importance of different predictors in regression. Standardizing ensures that each variable contributes equally to these techniques, preventing one variable from dominating the analysis due to its scale. It allows easier comparison of the magnitude of the effect of each variable on the outcome. Finally it can mitigate issues related to multicollinearity in regression analysis by putting variables on a similar scale, reducing the impact of differing scales on regression coefficients.
# select numerical values
Boston_numeric_cols <- Boston[, sapply(Boston, is.numeric)]
# The scale() to standardize and transform the data to have a mean of 0 and a standard deviation of 1.
scaled_boston <- scale(Boston_numeric_cols)
# convert to a data frame
scaled_table_boston <- as.data.frame(scaled_boston)
# how did the data change? Mean is now 0 so it has worked;
summary(scaled_table_boston)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# use the cut function to create categorical variables based on intervals or breaks in a numerical variable. We do this process for the crim variable from 0 to 0.25 to 0.50 to 0.75 to 1 (quantiles). Add labels for each category.
# include.lowest = TRUE is to ensure there is no NA category.
quantiles <- quantile(Boston$crim, probs = seq(0, 1, by = 0.25), na.rm = TRUE)
interval_labels <- c("premier_quantil", "second_quantil", "third_quantil", "fourth_quantil")
scaled_table_boston$quantiles_crime <- cut(Boston$crim, quantiles, labels= interval_labels,include.lowest = TRUE)
# Notes: Quantiles derived from numeric values can be considered categorical or continuous. When quantiles represent discrete categories or bins that divide a continuous variable into distinct groups, they are treated as categorical (e. small, medium, big). If quantiles represent numeric values that indicate the position or value relative to the distribution of a continuous variable (e.g., the 25th percentile, median, 75th percentile), they are considered continuous.
# drop the former column crim and create a new table
Boston_new <- scaled_table_boston
# For some reasons, it does not knit via the index file when I have this "select" function. However, the Knit works normally when I do it from the chapter 4 with or without this function. Therefore I have to remove it to be able to knit it via the index.rmd.
# In order to remove the crim, i do as below.
#%>%select(-crim, everything())
# We need 80% of the rows from total rows
train_size <- round(0.8 * nrow(Boston_new))
# Select a sample randomly among the dataset 80%
train_set <- sample(seq_len(nrow(Boston_new)), size = train_size)
# Create training and testing subsets
train_data <- Boston_new[train_set, ]
test_data <- Boston_new[-train_set, ]
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.
Linear Discriminant Analysis (LDA) can be quite useful in various real-life scenarios where classification or dimensionality reduction is required such as: Marketing and Customer Segmentation, product quality control, credit scoring.
It is ideal when we have multiple variables that can help us categorize each entry in one specific group where all others will have similar mean, and variance. The goal is to predict the categorical outcome or class based on a set of predictors.
LDA transforms the original variables into a smaller set of new variables, known as linear discriminants. These new variables (linear discriminants) are created in a way that maximizes the separation between different categories or classes in the data.By using the information captured in the linear discriminants, LDA helps in accurately assigning new observations (data points) to their respective categories or classes. LDA takes variables, combines them in a smart way to create new variables that are helpful for understanding differences between groups or categories, and then uses this understanding to categorize or classify new data based on what it has learned. This makes it useful for both simplifying data and making predictions about categories in that data.
LDA calculates the mean and variance for each predictor within each class or category in the target variable. It finds coefficients for these combinations by considering differences in means between classes and assumes equal variances for each predictor within classes. It assumes that the data approximately follows a normal distribution.
# check the impact of each variable on a categorical variable (quantiles_crime)
# each quantiles are approximately equal 25%
# LD1 explains 96% of the model. When LD1 captures 96% of the variance, it suggests that LD1 effectively separates the classes or groups in the data based on their mean differences and variance. Within LD1, data points are grouped together in a way that maximizes the separation between classes while minimizing the variation within each class.
# Positive coefficients indicate that higher values of that variable contribute to a higher score in that particular discriminant, while negative coefficients suggest the opposite. The larger the magnitude of the coefficient, the greater the impact of that variable on the discriminant.
library(MASS)
# Fit LDA on the train set - LDA is a statistical method used for classification. It fits an LDA model using the train_data, where quantiles_crime is the target variable to predict based on the other variables
lda_model <- lda(quantiles_crime ~ ., data = train_data)
# Predict on the test set by using the function predict to apply the model created from the train_data to test_data -> the idea here is to predict the class labels for this test_data
lda_pred <- predict(lda_model, test_data)$class # class extract the predicted class labels
# actual crime quantiles
actual_crime_categories <- test_data$quantiles_crime
# Extract LD scores from the LDA model's predictions
lda_scores <- predict(lda_model, test_data)$x # x accesses the matrix of posterior probabilities or scores associated with each class for the observations in the test_data.
# Create a dataframe with LD scores from first 2 columns of the lda_score and the predicted classes. Combining LD1, LD2, and the predicted classes for visualization. In many cases, visualizing beyond LD1 and LD2 might become complex to display effectively in two-dimensional plots. LD1 and LD2 are typically chosen for visualization as they capture the most discrimination power while allowing for a clearer visualization on a 2D plot.
plot_data <- data.frame(LD1 = lda_scores[, 1], LD2 = lda_scores[, 2], prediction_crime_quantile = as.factor(lda_pred))
plot_data
## LD1 LD2 prediction_crime_quantile
## 1 -3.6981646 -0.21588536 premier_quantil
## 11 -1.8549894 -0.64235666 second_quantil
## 13 -2.3093229 0.08193752 second_quantil
## 16 -2.3252314 -0.31652410 second_quantil
## 21 -1.6342913 -0.76131683 second_quantil
## 30 -1.9693603 -0.90333329 third_quantil
## 31 -1.6704217 -0.78220979 second_quantil
## 37 -2.0363588 0.04951612 second_quantil
## 39 -2.3337243 0.24743348 second_quantil
## 53 -3.2896756 0.95744486 premier_quantil
## 61 -0.9979333 0.94196457 second_quantil
## 66 -3.2474880 3.21818139 premier_quantil
## 68 -3.2552531 1.07462253 second_quantil
## 77 -2.2219250 -0.04031855 second_quantil
## 87 -3.1225147 0.17517948 second_quantil
## 89 -3.2378959 -0.59639064 second_quantil
## 91 -3.2968089 -0.25352905 second_quantil
## 92 -3.2449264 -0.31929177 second_quantil
## 105 -1.6046883 -0.44975103 second_quantil
## 110 -1.5349080 -0.52304057 second_quantil
## 112 -1.5195115 -0.64414509 third_quantil
## 118 -1.4967890 -0.45005177 second_quantil
## 119 -1.3544853 -0.44841457 second_quantil
## 124 -2.0176217 -1.74719206 third_quantil
## 125 -2.2070971 -1.64017150 third_quantil
## 129 -1.4523991 -1.60548475 third_quantil
## 134 -1.4237891 -1.57518774 third_quantil
## 137 -1.4059599 -1.51141894 third_quantil
## 138 -1.4954072 -1.55873000 third_quantil
## 140 -1.3466158 -1.59380356 third_quantil
## 156 -0.7346977 -3.37308048 third_quantil
## 160 -0.6718932 -3.18898528 third_quantil
## 162 -1.3778419 -2.37158560 third_quantil
## 172 -1.4785048 -1.35654558 third_quantil
## 174 -1.9962113 -0.36433577 second_quantil
## 186 -2.5901737 -0.50733454 second_quantil
## 188 -2.5305333 1.58997730 premier_quantil
## 193 -2.8098884 1.33791225 premier_quantil
## 194 -4.3970014 2.11278193 premier_quantil
## 201 -3.4622306 3.08281738 premier_quantil
## 204 -2.6170319 2.41200852 premier_quantil
## 205 -2.6414814 2.36819724 premier_quantil
## 211 -2.3140662 -1.24044828 second_quantil
## 214 -2.7162027 -0.11001481 second_quantil
## 215 -2.3689587 0.09400501 second_quantil
## 218 -1.7323103 -1.21114146 third_quantil
## 228 -0.8854793 -0.44542675 third_quantil
## 241 -2.1901619 1.06814685 premier_quantil
## 243 -2.1389418 1.05757895 second_quantil
## 244 -2.7232748 1.70809598 premier_quantil
## 247 -1.8686151 1.01695278 second_quantil
## 251 -2.2085227 1.39465062 premier_quantil
## 261 -1.4927185 -1.07797310 third_quantil
## 262 -1.3102810 -1.52403537 third_quantil
## 264 -1.3895743 -1.17887954 third_quantil
## 266 -1.6882534 -0.22011015 third_quantil
## 283 -2.7350460 -0.54098416 second_quantil
## 287 -3.9499116 2.56895190 premier_quantil
## 292 -2.6525536 2.56893476 premier_quantil
## 296 -3.2342182 -0.08973777 second_quantil
## 306 -1.2898497 1.18528405 premier_quantil
## 311 -2.3639719 0.21012922 second_quantil
## 314 -2.1865934 -0.74299419 second_quantil
## 317 -1.9814437 -0.85310588 second_quantil
## 328 -2.2402841 -0.07105517 second_quantil
## 332 -4.1362653 1.27502515 premier_quantil
## 337 -2.1865592 0.01719251 second_quantil
## 341 -2.0898174 -0.06634654 second_quantil
## 343 -4.0671276 -0.57514751 second_quantil
## 344 -2.2082679 1.48784989 premier_quantil
## 350 -4.1116672 0.93556172 premier_quantil
## 354 -2.5554844 2.40411964 premier_quantil
## 355 -2.8506045 2.91735230 premier_quantil
## 359 5.7819841 -0.99339607 fourth_quantil
## 360 6.0943657 -0.48113829 fourth_quantil
## 371 5.7952693 -1.05182593 fourth_quantil
## 373 6.2162505 -1.15799680 fourth_quantil
## 379 6.4117941 0.46092310 fourth_quantil
## 383 6.2814224 0.17927658 fourth_quantil
## 384 6.3028241 0.11435689 fourth_quantil
## 392 6.0873817 -0.13169591 fourth_quantil
## 406 7.4276821 1.24639183 fourth_quantil
## 408 6.3471246 0.07786106 fourth_quantil
## 410 6.3877914 0.13568874 fourth_quantil
## 427 5.6983160 1.43541718 fourth_quantil
## 432 5.9955283 0.62587776 fourth_quantil
## 446 6.6916877 -0.34290394 fourth_quantil
## 450 6.1794826 -0.10222327 fourth_quantil
## 453 5.9912065 -0.09371909 fourth_quantil
## 454 6.0353461 -0.34815168 fourth_quantil
## 458 6.3435527 0.01689029 fourth_quantil
## 459 6.0315389 0.01170339 fourth_quantil
## 462 5.8714359 -0.10535634 fourth_quantil
## 471 5.4828233 0.68175899 fourth_quantil
## 473 5.3706256 0.68524014 fourth_quantil
## 480 5.8574602 0.65731836 fourth_quantil
## 481 5.0636158 1.20428580 fourth_quantil
## 483 5.0398335 0.94320653 fourth_quantil
## 499 -1.3238090 -0.53194343 third_quantil
## 500 -1.2091825 -0.49009274 third_quantil
## 503 -3.0093031 -1.01166402 second_quantil
# Create a scatterplot of LD1 and LD2
plot_LDA <- ggplot(plot_data, aes(x = LD1, y = LD2, color = prediction_crime_quantile)) +
geom_point() +
labs(title = "LDA Biplot with Predicted Crime Quantiles")
plot_LDA
# adding real values - comparison of actual vs predicted values in test_data
realVsPred_plot <- plot_LDA +
geom_point(aes(color = actual_crime_categories), size = 4, alpha = 0.1) +
labs(color = "Real Quantiles of Crime")
realVsPred_plot
# the accuracy of predictions using test data
accuracy <- mean(lda_pred == test_data$quantiles_crime)
print(paste("Accuracy of LDA model on test data:", round(accuracy * 100, 2), "%"))
## [1] "Accuracy of LDA model on test data: 69.31 %"
Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)
# save the crime data:
actual_crime_categories <- test_data$quantiles_crime
# Remove the categorical crime variable from the test dataset
test_data_without_crime <- subset(test_data, select = -c(quantiles_crime))
# get the classes based on the model - this was done earlier with lda_pred. so I am a bit confused.
lda_pred_test <- predict(lda_model, test_data_without_crime)$class
# get the table with the prediction vs actual - results are same between the 2 ways since I did 2 times the same steps. I might have missed something in the requests.
cross_tab <- table(Predicted = lda_pred_test, Actual = actual_crime_categories)
cross_tab
## Actual
## Predicted premier_quantil second_quantil third_quantil fourth_quantil
## premier_quantil 16 4 0 0
## second_quantil 11 15 9 0
## third_quantil 1 5 15 0
## fourth_quantil 0 0 1 24
cross_table <- table(Predicted = lda_pred, Actual = actual_crime_categories)
cross_table
## Actual
## Predicted premier_quantil second_quantil third_quantil fourth_quantil
## premier_quantil 16 4 0 0
## second_quantil 11 15 9 0
## third_quantil 1 5 15 0
## fourth_quantil 0 0 1 24
Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)
# Boston is reload
Boston_load <- Boston
# scale of Boston
Boston_scaled <- scale(Boston_load)
# Calculate distance betwwen scaled data points:
distances <- dist(Boston_scaled)
# Visualize the distances using fviz_dist()
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_dist(distances, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
# k mean: The k-means algorithm groups observations into clusters based on their similarity, aiming to minimize the variation within clusters while maximizing the variation between clusters. We need to define the number of clusters we need to do the kmean analyse.
# Elbow Method to find optimal number of clusters: Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.
# The Squared Error for each point is the square of the distance of the point from its representation i.e. its predicted cluster center.
# The WSS score is the sum of these Squared Errors for all the points.
wss <- numeric(10) # Initialize within-cluster sum of squares vector
for (i in 1:10) {
kmeans_model <- kmeans(Boston_scaled, centers = i, nstart = 10)
wss[i] <- kmeans_model$tot.withinss # Store within-cluster sum of squares
}
# Plotting the Elbow Method
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within-cluster Sum of Squares", main = "Elbow Method")
# or we can use the direct function for the Elbow method
kmeans_model_optimal2 <- fviz_nbclust(Boston_scaled, kmeans, method = "wss")
# or we can use the silhuette method
fviz_nbclust(Boston_scaled, kmeans, method = "silhouette")
# Visualize clusters using pairs() or ggpairs()
pairs(Boston_scaled, col = kmeans_model$cluster)
k <- 2 # 2 seems to be the best option according to the Elbow Method and the silhouette method
#Kmean model:
kmeans_model <- kmeans(Boston_scaled, centers = k, nstart = 25)
cluster_assignments <- kmeans_model$cluster
# visualize the clusters thanks to fviz_cluster function
fviz_cluster(kmeans_model, data = Boston_scaled)
library(GGally)
# Combine the scaled data with the cluster assignments (cluster 1 or 2)
clustered_data <- cbind(as.data.frame(Boston_scaled), Cluster = as.factor(cluster_assignments))
# Visualize clusters using ggpairs()
ggpairs(clustered_data, aes(color = Cluster))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Mean for each cluster and variable:
clustered_data %>%
mutate(Cluster = kmeans_model$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
## # A tibble: 2 × 15
## Cluster crim zn indus chas nox rm age dis rad
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 -0.389 0.262 -0.615 0.00291 -0.582 0.245 -0.433 0.454 -0.583
## 2 2 0.724 -0.487 1.14 -0.00541 1.08 -0.455 0.805 -0.844 1.08
## # ℹ 5 more variables: tax <dbl>, ptratio <dbl>, black <dbl>, lstat <dbl>,
## # medv <dbl>